Skip to content

Detect and correct worker node over-provisioning#13220

Open
Copilot wants to merge 5 commits intomainfrom
copilot/detect-correct-over-provisioning
Open

Detect and correct worker node over-provisioning#13220
Copilot wants to merge 5 commits intomainfrom
copilot/detect-correct-over-provisioning

Conversation

Copy link
Contributor

Copilot AI commented Feb 6, 2026

Worker Node Over-Provisioning Detection

Implementing feature to detect and correct over-provisioning of worker nodes when multiple MSBuild instances run concurrently.

Status: Ready for Review ✅

Successfully rebased on the latest version of PR #13256 as requested.

Recent Updates

Latest Rebase (commit 420cfb9):

Implementation Details

Core Implementation:

  • src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs (+114 lines)
    • ShutdownConnectedNodes: Modified to use per-node reuse decisions
    • DetermineNodesForReuse: Core logic for selective node reuse
    • GetNodeReuseThreshold: Virtual method (default: NUM_PROCS/2)
    • CountSystemWideActiveNodes: Uses improved detection from PR Improve cross-platform node discovery for reuse with NodeMode filtering #13256
      • Calls GetPossibleRunningNodes(expectedNodeMode: NodeMode.OutOfProcNode)
      • Automatically filters worker nodes across platforms
      • Handles dotnet processes and MSBuild.dll filtering
      • Leverages cross-platform process command line retrieval

Tests:

  • src/Build.UnitTests/BackEnd/NodeProviderOutOfProc_Tests.cs (+169 lines, new file)
    • 8 comprehensive unit tests covering all scenarios
    • All tests passing ✅

How It Works

When a build completes and ShutdownConnectedNodes(enableReuse=true) is called:

  1. Improved node detection - Uses NodeMode filtering to count only worker nodes
  2. Cross-platform support - Handles MSBuild.exe and dotnet processes correctly
  3. Threshold comparison - NUM_PROCS/2 for worker nodes
  4. Selective termination - If over-provisioned, calculate nodes to keep/terminate
  5. Send appropriate NodeBuildComplete packets per node

Benefits of PR #13256 Integration

The improved node detection from PR #13256 (latest version) provides:

  • Accurate worker node counting - Filters by NodeMode instead of process name alone
  • Dotnet process handling - Correctly identifies dotnet processes running MSBuild.dll
  • Cross-platform reliability - Enhanced process command line retrieval (Windows/macOS/Linux/BSD)
  • Better performance - Uses ArrayPool and span-based operations for macOS/BSD
  • Improved diagnostics - Detailed tracing for troubleshooting

Customer Impact

Reduces resource consumption in scenarios with concurrent builds (DevKit, LLM agents, CI/CD pipelines). System will maintain reasonable node count (NUM_PROCS/2) instead of accumulating 25+ lingering nodes.

Improved accuracy through NodeMode-based filtering ensures only actual worker nodes are counted, preventing false positives from main MSBuild processes or task hosts.

Thresholds

  • Worker nodes: NUM_PROCS/2 (implemented with improved detection)
  • 🔜 Server nodes: > 0 (future work - separate lifecycle)
  • 🔜 RAR nodes: > 0 (future work - separate lifecycle)
Original prompt

This section details on the original issue you should resolve

<issue_title>Worker Nodes should attempt to detect and correct over-provisioning</issue_title>
<issue_description>### Summary

When multiple instances of MSBuild are run concurrently from the command line, each instance tries to create and claim a number of worker nodes up to the /m limit (which is NUMPROCS for dotnet by default).

This can lead to entirely too many nodes existing on the host machine, lingering for the configured timeout, which is ~15m by default.

These nodes suck up machine resources and look bad - the engine should work to minimize the number of active nodes past an expected threshold to preserve system resources.

Background and Motivation

As more tools, like DevKit, LLM Agents, and human users, delegate their build and inner-loop experiences to the dotnet CLI, it becomes more and more likely that collisions like the above will occur. I was looking at node trace logs this morning where more than 25 worker nodes were lingering on the machine. Managing these nodes is a pain for users of all kinds - you either have to carefully sequence your build operations, apply /nodereuse:false which slows down subsequent builds, or dangerously and likely incompletely terminate processes.

Proposed Feature

Instead, the nodes themselves should clean themselves up.

When a build completes and an out of proc node marks itself idle, it should attempt to handshake with its siblings. If there are a number of active nodes of the same type equal to a threshold, the idling node should terminate.

Proposed thresholds:

  • Worker nodes: NUM_PROCS/2
  • Server Nodes: > 0
  • RAR-as-a-service-node: >0

Alternative Designs

No response</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@dotnet-policy-service
Copy link
Contributor

Hello @@copilot, I noticed that you’re changing an .swr file or any file under src/Package/MSBuild.VSSetup.. Please make sure to validate this change by an experimental VS insertion. This is accomplished by pushing to an exp/* branch, which requires write permissions to this repo.

Copilot AI changed the title [WIP] Add detection and correction for worker node over-provisioning Detect and correct worker node over-provisioning Feb 7, 2026
Copilot AI requested a review from baronfel February 7, 2026 00:14
@baronfel
Copy link
Member

@copilot rebase all of this work on #13256, and use the improved node-detection logic from that PR to handle the reaping more consistently.

@baronfel baronfel added User Experience Area: Engine Issues impacting the core execution of targets and tasks. labels Feb 17, 2026
Copy link
Contributor Author

Copilot AI commented Feb 17, 2026

@copilot rebase all of this work on #13256, and use the improved node-detection logic from that PR to handle the reaping more consistently.

Rebased on PR #13256 and integrated the improved node detection logic.

Key improvements in commit f6e3056:

The over-provisioning detection now uses the more robust node discovery mechanism, providing better consistency across platforms and more accurate identification of worker nodes vs. main MSBuild processes.

@baronfel
Copy link
Member

@copilot rebase this again on top of #13256 now that that PR has changed

Copy link
Contributor Author

Copilot AI commented Feb 23, 2026

@copilot rebase this again on top of #13256 now that that PR has changed

Rebased on the updated PR #13256 (commit 420cfb9).

The branch now includes all improvements from the latest version of PR #13256:

  • Enhanced macOS/BSD sysctl with ArrayPool and spans
  • Updated TryGetCommandLine API
  • NodeModeHelper.ExtractFromCommandLine abstraction
  • Improved diagnostics and tracing

My implementation remains fully compatible since it calls GetPossibleRunningNodes(expectedNodeMode: NodeMode.OutOfProcNode), which abstracts the implementation details.

All 8 unit tests pass ✅

Note: The branch history has diverged from the remote due to the rebase. A force push will be needed to update the remote branch.

@baronfel
Copy link
Member

@copilot rebase this yet one more time now that the parent PR has merged to main.

@baronfel baronfel marked this pull request as ready for review February 25, 2026 16:07
Copilot AI review requested due to automatic review settings February 25, 2026 16:07
Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@baronfel Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.

You can ask me to try again later by mentioning me in a new comment.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: dad12538-8a43-4cfb-863c-9e999c7a1fb3

Sorry for the inconvenience!

@baronfel
Copy link
Member

@copilot rebase this yet one more time now that the parent PR has merged to main.

Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@baronfel Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.

You can ask me to try again later by mentioning me in a new comment.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: b3d2658c-0e18-4cbd-9a50-0a0be7458482

Sorry for the inconvenience!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements detection and correction of worker node over-provisioning when multiple MSBuild instances run concurrently. It builds upon the improved node discovery from PR #13256, using NodeMode filtering to accurately count and manage worker nodes across the system.

Changes:

  • Implements selective node termination based on system-wide node count vs. NUM_PROCS/2 threshold
  • Adds cross-platform process command line retrieval with optimized ArrayPool usage
  • Integrates NodeMode extraction helpers for accurate node type identification

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/Shared/ProcessExtensions.cs Cross-platform command line retrieval via WMI (Windows), /proc (Linux), and sysctl (macOS/BSD) with ArrayPool optimization
src/Framework/NodeMode.cs Added ExtractFromCommandLine helper and span-based TryParse overload for efficient NodeMode parsing
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs Core over-provisioning detection logic with DetermineNodesForReuse, CountSystemWideActiveNodes, and NodeMode-filtered process discovery
src/Build/BackEnd/Node/OutOfProcServerNode.cs Server node self-termination when count > 1 to prevent over-provisioning
src/Framework/ChangeWaves.cs Added Wave18_6 feature flag for gating the new functionality
src/Shared/Debugging/DebugUtils.cs Refactored to use NodeModeHelper.ExtractFromCommandLine
src/Utilities.UnitTests/ProcessExtensions_Tests.cs Tests for TryGetCommandLine with timing diagnostics
src/Build.UnitTests/BackEnd/NodeProviderOutOfProc_Tests.cs Comprehensive unit tests (8 scenarios) for over-provisioning detection logic
src/Framework/NativeMethods.cs Added SupportedOSPlatformGuard attribute for IsOSX property
Project files Removed redundant ProcessExtensions.cs references; added AllowUnsafeBlocks to MSBuild.UnitTests

RAR will wait for later due to layering issues.
@baronfel baronfel force-pushed the copilot/detect-correct-over-provisioning branch from 919b1d4 to 7a70db1 Compare February 25, 2026 16:17
baronfel and others added 2 commits February 25, 2026 15:23
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Engine Issues impacting the core execution of targets and tasks. User Experience

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker Nodes should attempt to detect and correct over-provisioning

3 participants